Retrieval Performance and Visual Dispersion of Query Sets

نویسنده

  • Mark E. Rorvig
چکیده

In the course of eight TREC Conferences, retrieval performance of all systems started high and then declined. This was especially true for conference 5. Only in conferences 7 and 8 have performance levels reached those initially achieved. In this paper, scaling of the corpus of 450 TREC topics is performed. It is observed that as the visual dispersion of a topic set increases, the level of retrieval performance across systems declines for that set. Conversely, as the visual dispersion of topics decreases, system performance rises. In common elements of conferences 2, 5, and 8, this relationship appears to hold despite increases in the number of participating systems in TREC. It is proposed that visual dispersion measures should be used to describe topic set difficulty in addition to measures such as “hardness”. 1 This study was supported by Intel Corporation. 2 A color version of this paper is available at . 3 Correspondence should be sent to the author at [email protected]. Introduction In the middle of a wonderful review article of the work of Project Intrex from 1965 to 1973, the authors interject this startling phrase: “Our analysis has shown that choice of words used in search strategies has a major influence on retrieval effectiveness” (Overhage and Reintjes, 1974, p. 174). This phrase startles because it is at once a reduction and an enigma. There is no doubt that word choice is important, but how can it be so important that retrieval performance depends upon it to the exclusion of so many other system and architecture considerations? The query is often the last item considered in IR testing. Usually its study is incorporated in the interaction effects between systems and users; a difficult and fluid arena. The suspicion that queries might establish a system performance limit did not arise in TREC literature until conference 5 (Voorhees and Harman, 1997). However, it has since been recognized as an area for important study, resulting in the establishment of a query track since conference 7 (Buckley, 1998). It is difficult to quantify the meaning of topic difficulty. Voorhees and Harman (1997) note that it is weakly (r = 0.33) correlated with the percent of unique relevant documents for that query. In the same volume, Sparck Jones remarks that “...low levels of performance...in TREC 4 and 5 must be taken as representing a more realistic retrieval situation than TREC 2 and 3...” (Sparck Jones, 1997, p. B-2). Sparck Jones comments further Retrieval Performance and Visual Dispersion of Query Sets in her review of TREC7 that “Since TREC-7 full topics are shorter than TREC-6, but TREC-7 performance levels are better, the TREC-7 topics are presumably not as hard...However, performance is not as tightly correlated with topic length, and specifically with version, as might be expected...” (Sparck Jones, 1999, p. B-6). Factors that cannot describe query difficulty are: (1) topic components (concepts, narratives, etc.), (2) topic length, (3) and topic construction (creating topics without regard to existing documents vs. the contrary practice). Document uniqueness is the only quantitative measure so far offered. Indeed, topic hardness appears to rest in that zone of phenomena that many can mutually observe, but cannot describe in terms that would eventually permit control. This paper proposes an additional quantitative measure for query difficulty. The measure is applicable to sets of topics only, but is based on the scaled similarity of documents by text terms. The proposed measure is replicable, and conforms to observed system performance behavior across three representative TREC conferences. Methodology TREC Topics were copied from the trec.nist.gov site and parsed into individual documents. A document similarity matrix was created using the cosine vector measure of similarity. The similarity matrix was scaled using maximum likelihood method customary for text data (Rorvig, 1999a) and plotted using a conventional graphics tool. Figure 1: Each dot in the illustration above represents a TREC topic. Arrayed from left to right, topic sets reveal increasing dispersion from topic set 3 onward. This effect does not change until topic sets 7 and 8 appear. -3 -2 -1 0 1 2 3 -6 -4 -2 0 2 4 TREC Topics 1-450 ad hoc RtgTr

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Image retrieval using the combination of text-based and content-based algorithms

Image retrieval is an important research field which has received great attention in the last decades. In this paper, we present an approach for the image retrieval based on the combination of text-based and content-based features. For text-based features, keywords and for content-based features, color and texture features have been used. Query in this system contains some keywords and an input...

متن کامل

Semiautomatic Image Retrieval Using the High Level Semantic Labels

Content-based image retrieval and text-based image retrieval are two fundamental approaches in the field of image retrieval. The challenges related to each of these approaches, guide the researchers to use combining approaches and semi-automatic retrieval using the user interaction in the retrieval cycle. Hence, in this paper, an image retrieval system is introduced that provided two kind of qu...

متن کامل

QEA: A New Systematic and Comprehensive Classification of Query Expansion Approaches

A major problem in information retrieval is the difficulty to define the information needs of user and on the other hand, when user offers your query there is a vast amount of information to retrieval. Different methods , therefore, have been suggested for query expansion which concerned with reconfiguring of query by increasing efficiency and improving the criterion accuracy in the information...

متن کامل

روشی برای بازخورد ربط براساس بهبود تابع شباهت در بازیابی تصویر بر اساس محتوا

In content based image retrieval systems, the suitable visual features are extracted from images and stored in the feature database Then the feature database are searched to find the most similar images to the query image. In this paper, three types of visual features by 270 components were used for image indexing. Here, we use a weighted distance for similarity measurement between two images....

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999